11

Packaging and Distributing Python Code

In this chapter, we will focus on ways of packaging and shipping various types of Python packages. We will consider complete applications intended for end users as well as libraries that are typically consumed only by software developers.

Everyone that writes software does so for a reason. You may be a hobbyist that makes applications for fun and wants to share them with friends for their amusement. You may be a scientist or researcher that solves an important problem and wants to share code with other people to make their lives easier. Or you may be a professional that writes code for a living and you want to make your application or service available for paying customers.

Every reason for writing code is good but each one usually comes with its own preferable way of distributing the software. In this chapter, we will discuss three main scenarios:

  • Packaging and distributing libraries
  • Packaging applications and services for the web
  • Creating standalone executables

At first, we will focus on packaging and distributing libraries as this is the scenario that may support the development of other packaging and distribution flows. But before we continue on this topic, let's first consider the technical requirements for this chapter.

Technical requirements

The following are Python packages that are mentioned in this chapter that you can download from PyPI:

  • twine
  • wheel
  • cx_Freeze
  • py2exe
  • pyinstaller

Information on how to install packages is included in Chapter 2, Modern Python Development Environments.

The code files for this chapter can be found at https://github.com/PacktPublishing/Expert-Python-Programming-Fourth-Edition/tree/main/Chapter%2011.

Packaging and distributing libraries

A software library is a reusable piece of code that can be used as a component of a larger application or another library. Libraries usually focus on solving limited sets of problems of a specific technical area, but there is no limit for library size. For the purpose of this chapter, we will consider frameworks to be libraries too. That's because frameworks also can be understood as components of an application, although on a larger and more generic scale.

Libraries in Python are distributed in the form of packages (or modules). We've been using them throughout the book already. Most of the packages that we've obtained from PyPI in previous chapters can in fact be considered libraries. Most of the open-source Python libraries are distributed through PyPI and that's why we will discuss this topic through the prism of distributing open-source packages.

You should know how to create packages even if you are not interested in distributing your code as open-source. Knowing how to make your own packages will give you more insight into the packaging ecosystem and will help you to work with third-party code that is available on PyPI (which you are probably using already).

Python packaging can be a bit overwhelming at first. The main reason for that is the confusion about proper tools for creating Python packages. Anyway, once you create your first package, you will see that this is not as hard as it looks. Also, knowing proper, state-of-the-art packaging tools helps a lot.

But before we get to the state-of-the-art tools, let's take a closer look at the anatomy of a Python package.

The anatomy of a Python package

The minimal distributable piece of Python code is a module, which is a single source file ending with the .py extension. A collection of modules is called a package. While you could theoretically distribute your Python packages and modules as a raw source code bundle and let your users use it through the Python interpreter, it would be really problematic for non-technical people. Even developers expect some amount of minimal packaging that would allow them to install your application or library using Python packaging tools like pip or Poetry.

There are several possible layouts of the source tree for a Python package that has to be distributed on PyPI. There are a few recurring patterns and almost every package has a few common files. It's hard to tell which layout is best so let's simply consider the following layout, which is the authors' favorite:

.
├── packagename/
│   └── __init__.py
├── tests/
│   ├── __init__.py
│   └── conftest.py
├── bin/
├── data/
├── docs/
├── README.md 
├── LICENSE
├── setup.py
├── setup.cfg
├── MANIFEST.in
└── CHANGELOG.md

The main structure of the project sources is dictated by the sub-directory layout. Each one has its own role:

  • packagename/: This is the directory holding the Python sources of the package. This is the core of what is distributed on PyPI. Preferably, this has exactly the same name as the name under which the package is registered on PyPI, although many developers use dashes instead of underscores in the PyPI registration. Usually, there's only one top-level package in the source tree.
  • tests/: This is the test package directory. It holds test modules and (optionally) test sub-packages. In the example above, we see the conftest module, which is a special test module of the pytest framework that usually contains test fixtures and optional pytest plugins. This directory usually isn't distributed on PyPI because the tests name is pretty common, and your test package would likely conflict with other test packages in the site-packages directory after the installation. If you want to distribute tests with your package, you should namespace it by nesting it within the main package directory (here, the packagename/ directory).

    Some developers prefer to put a package sources directory and test package directory inside of an additional top-level src/ directory. This doesn't change a lot and is rather a matter of personal preference.

  • bin/: This is a directory for shell scripts and utilities that may be helpful in package development. It can hold, for instance, scripts for building documentation, custom linters, or utilities aiding in the package distribution process. These scripts are not distributed on PyPI.

    If a package has to distribute some actual shell scripts, the common convention is to put them in the scripts/ directory.

  • data/: This is a directory for essential data files that have to be included in package distribution. An example could be pre-trained machine learning models, images, or translation files.
  • docs/: This is a directory for package documentation. Documentation can take any form, but many developers use automated documentation building systems like Sphinx or MkDocs. In such cases, the docs/ directory holds documentation sources and configuration for those systems but not the rendered documentation files. This directory often isn't distributed on PyPI.

    Sphinx is a documentation generator that is used to build official Python documentation. You can learn more about Sphinx at https://www.sphinx-doc.org.

    Sphinx is powerful but quite heavyweight. Sometimes (especially for smaller packages) a more lightweight tool can be a better alternative. MkDocs is a popular static site generator that is specifically designed for building project documentation. You can learn more about MkDocs at https://www.mkdocs.org.

Files outside of the above directories usually provide configuration tools or hold metadata of the package. The suggested layout lists six files that are the essential minimum for an open-source package:

  • README.md: This file contains a minimal description and/or documentation of the package. The .md extension denotes the Markdown markup language, which is a popular choice with developers. The use of dedicated markup language is fully optional and common alternative names for this file are README or README.txt. It is a good practice to include this file in package distribution.

    Another popular markup choice for documenting a Python project is reStructuredText (denoted by the .rst file extension). It is the default markup language of the Sphinx engine. You can read more about reStructuredText at https://docutils.sourceforge.io/rst.html.

  • LICENSE: This file contains a software license for package users. It is usually a plain-text file without any specific markup language. Package distribution should include this file.
  • setup.py: This is a Python package distribution script. It is used to build package distributions and upload them to the package registry. Among other things, it contains package metadata and definitions of extensions (if the package provides any). It is included only in source distributions (we will discuss them in the Types of package distributions section).
  • setup.cfg: This is an optional Python package configuration file (INI-style). It may include package metadata and default options for setup.py script subcommands. Many Python development tools (test frameworks, linters) use dedicated sections in this file as their own configuration too.
  • MANIFEST.in: This is the template file for the package file manifest. It can be used to tell the setup.py script which of the non-source files should be included in the package distribution.
  • CHANGELOG.md: This is an optional file with a log of all changes made to the package up to the current release. It is a good practice to include it in the package distribution. Short changelogs can also be included in the README file, although for projects with frequent releases, it is usually better to have a dedicated file for that purpose.

Many developers choose to maintain a log of changes in a more convenient form outside of the source tree. A popular example is the project's Releases section on GitHub. Still, it is a good practice to include at least a minimal log of changes with package distribution as well.

Some of those files have a very specific syntax or structure, which we will discuss shortly. Let's take a closer look at the most important one—the setup.py script.

setup.py

The root directory of a project that has a distributable Python package contains a setup.py script. It provides essential package metadata like version number, description, authors, license type, or required dependency. Package metadata is expressed as arguments to the setuptools.setup() function.

Python provides the built-in distutils module for the purpose of code packaging, but it is actually recommended to use the setuptools instead. The setuptools package provides a layer of multiple enhancements over the standard distutils module. Also, starting from Python 3.10, the distutils package will be officially deprecated and the setuptools codebase is now independent of the distutils module. That's why we will be discussing the behavior of the setuptools package in this chapter.

Therefore, the minimum content for the setup.py file is as follows:

from setuptools import setup 
 
setup( 
    name='mypackage', 
)

Note that using a bare name argument is just enough to register the package in the package registry but it still does not allow you to create functional distributions. In order to create functional distributions, you will have to provide a little more metadata that will allow the setuptools package to properly collect source files. We will discuss the most important metadata entries later, in the Essential package metadata section.

The name argument defines the full name of the package distribution. If you decide to publish your package in a registry like PyPI, it will be registered under this exact name. From there, the script provides several commands that can be listed with the --help-commands option. The following is an example output:

$ python3 setup.py --help-commands
Standard commands:
  build             build everything needed to install
  clean             clean up temporary files from 'build' command
  install           install everything from build directory
  sdist             create a source distribution (tarball, zip file, etc.)
  register          register the distribution with the Python package index
  bdist             create a built (binary) distribution
  check             perform some checks on the package
  upload            upload binary package to PyPI
Extra commands:
  bdist_wheel       create a wheel distribution
  alias             define a shortcut to invoke one or more commands
  develop           install package in 'development mode'
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

The actual list of commands is longer and can vary depending on the available setuptools extensions. It was truncated to show only those that are most important and relevant to this chapter.

Standard commands are the built-in commands provided by distutils, whereas extra commands are the ones provided by third-party packages, such as setuptools or any other package that defines and registers a new command. Here, one such extra command registered by another package is bdist_wheel, provided by the wheel package.

setup.cfg

The setup.cfg file contains default options for commands of the setup.py script. This is very useful if the process for building and distributing the package is more complex and requires many optional arguments to be passed to the setup.py script commands. This setup.cfg file allows you to store such default parameters together with your source code on a per-project basis. This will make your distribution flow bound to the project and also provides transparency about how your package was built/distributed to the users or your team members.

The syntax for the setup.cfg file is the same as provided by the built-in configparser module so it is similar to the popular Microsoft Windows INI files. Here is an example of the setup.cfg configuration file that provides some global defaults as well as defaults for sdist and bdist_wheel commands:

[global]
quiet=1 
 
[sdist]
formats=tar,zip
[bdist_wheel]
universal=1

The above configuration will ensure that source distributions (the sdist section) will always be created in two formats (ZIP and TAR) and the built wheel distributions (the bdist_wheel section) will be created as universal wheels that are independent of the Python version. Also, most of the output will be suppressed on every command by the global --quiet switch.

Note that the global quiet option is included here only for demonstration purposes and it may not be a sensible choice to suppress the output for every command by default. You can also provide a global personal configuration file named .pydistutils.cfg in your home directory.

MANIFEST.in

When building a source distribution with the sdist command, the setuptools module browses the package directory looking for files to include in the archive. By default, setuptools will include the following files based on arguments of the setup() function:

  • All Python source files implied by the py_modules and packages arguments
  • All extension source files listed in the ext_modules argument
  • All scripts specified by the scripts argument
  • All files specified by the package_data and data_files arguments
  • The license files specified by the license_file and license_files arguments
  • Files that match the glob pattern test/test*.py
  • Files named setup.py, pyproject.toml, setup.cfg, and MANIFEST.in
  • Files named README, README.txt, README.rst, and README.md

Besides that, if your package is versioned with a version control system such as Subversion, Mercurial, or Git, there is the possibility to auto-include all version-controlled files using additional setuptools extensions such as setuptools-svn, setuptools-hg, and setuptools-git. Integration with other version control systems is also possible through other custom extensions.

No matter if it uses the default built-in file collection strategy or one defined by a custom extension, sdist will create a MANIFEST file that lists all files and will include it in the final archive.

Although the setup() function arguments allow you to list any type of file to be included in the package distribution, listing them one by one may not be the most convenient option. Also, using the extensions for a specific version control system may capture some files that you may not want to include in your package distribution. In both cases, you can use the MANIFEST.in template to provide an extra manifest template to automatically include or exclude files based on the file name pattern.

Let's say you are not using any extra extensions, and you need to include in your package distribution some files that are not captured by default. You can define a template called MANIFEST.in in your package root directory (the same directory as the setup.py file). This template directs the sdist command on which files to include.

The MANIFEST.in template defines one inclusion or exclusion rule per line. The following is an example of the MANIFEST.in template that enables the inclusion of the LICENSE file, extra textual information found in .txt files, and all Markdown-formatted files:

include HISTORY.txt 
include README.txt 
include CHANGES.txt 
include CONTRIBUTORS.txt 
include LICENSE 
recursive-include *.md

The full list of the MANIFEST.in commands can be found in the official distutils documentation available at https://packaging.python.org/guides/using-manifest-in/#manifest-in-commands.

Essential package metadata

The most important argument of the setup() function is name. Without it, the setuptools package will assume the UNKNOWN name, which won't allow you to easily distinguish different package distributions.

Using just the name argument is of course not enough to provide proper and functional packaging for your code. The most important arguments that the setup() function can receive are as follows:

  • version: This is the current version specifier of the package.
  • description: This includes a short description of the package. It is usually one sentence that explains the purpose of the package.
  • long_description: This includes a full description that can be in reStructuredText (default) or other supported markup languages.
  • long_description_content_type: This defines the MIME type of the long description; it is used to tell the package repository what kind of markup language is used for the package description.
  • keywords: This is a list of keywords that define the package and allow for better indexing in the package repository.
  • author: This is the name of the package author or organization that takes care of it.
  • author_email: This is the contact email address of the package author.
  • install_requires: This lists the packages and their versions that are required dependencies of your package. For instance, if your package requires some other packages available on PyPI in order to work, you put their names (and their version requirements) here.
  • url: This is the project URL. It is often the URL to the site where project sources and/or documentation are hosted.
  • license: This is the name of the license (GPL, LGPL, and so on) under which the package is distributed.
  • py_modules: A list of Python modules to include in the distribution. It can be used for simple projects that have only top-level modules that do not share a common package namespace.
  • packages: This is a list of all package names in the package distribution; setuptools() provides a helpful function called find_packages() that can automatically find package names to include.
  • namespace_packages: This is a list of namespace packages within a package distribution.

The above arguments are essential metadata entries that will allow you to properly build package distributions but also attribute your code to you. Pay attention to license information and all addresses (email and URLs) that will allow users to gain more information about your package and terms of use or to reach you for help.

The setuptools package provides a few more metadata entries that we didn't list here. The detailed description of all package metadata entries is described in the PEP 345 document available at https://www.python.org/dev/peps/pep-0345/.

One of the important but not essential arguments is classifiers. It allows you to categorize your application using a standardized set of software categories known as trove classifiers. This feature is especially useful if you want to publish your application on PyPI. Let's take a closer look at it.

Trove classifiers

PyPI provides a solution for categorizing applications with the set of classifiers called trove classifiers. All trove classifiers form a tree-like structure. Each classifier string defines a list of nested namespaces where every namespace is separated by the :: substring. Their list is provided to the package definition as a classifiers argument of the setup() function.

Here is an example list of classifiers taken from the solrq project available on PyPI:

from setuptools import setup 
 
setup( 
    name="solrq", 
    # (...) 
 
    classifiers=[ 
        'Development Status :: 4 - Beta', 
        'Intended Audience :: Developers', 
        'License :: OSI Approved :: BSD License', 
        'Operating System :: OS Independent', 
        'Programming Language :: Python', 
        'Programming Language :: Python :: 2', 
        'Programming Language :: Python :: 2.6', 
        'Programming Language :: Python :: 2.7', 
        'Programming Language :: Python :: 3', 
        'Programming Language :: Python :: 3.2', 
        'Programming Language :: Python :: 3.3', 
        'Programming Language :: Python :: 3.4', 
        'Programming Language :: Python :: Implementation :: PyPy', 
        'Topic :: Internet :: WWW/HTTP :: Indexing/Search', 
    ], 
) 

Trove classifiers are completely optional in the package definition but provide a useful extension to the basic metadata available in the setup() interface. Among others, trove classifiers may provide information about supported Python versions, supported operating systems, the development stage of the project, or the license under which the code is released. Many PyPI users search and browse the available packages by categories so a proper classification helps packages to reach their target.

Trove classifiers serve an important role in the whole packaging ecosystem and should never be ignored. There is no organization that verifies package classification, so it is your responsibility to provide proper classifiers for your packages and not introduce chaos to the whole package index.

At the time of writing this book, there are 756 classifiers available on PyPI that are grouped into the following major categories:

  • Development status
  • Environment
  • Framework
  • Intended audience
  • License
  • Natural language
  • Operating system
  • Programming language
  • Topic
  • Typing

This list is ever-growing, and new classifiers are added from time to time. It is thus possible that the total count of them will be different at the time you read this. The full list of currently available trove classifiers is available at https://pypi.org/classifiers/ and can be accessed in Python code via the trove-classifiers package available at https://github.com/pypa/trove-classifiers.

We know what the typical anatomy of a Python package is. Now it's time to discuss various types of package distributions supported by standard Python packaging tools.

Types of package distributions

Package distribution is a packaging artifact that wraps Python package sources, metadata, and any additional files into a single-file archive that can be distributed to other developers either in raw form or through the package repository.

There are generally two types of distributions for Python packages:

  • Source distributions
  • Built (binary) distributions

Source distributions are the simplest and most platform-independent. For pure Python packages, it is a no-brainer. Such a distribution contains only Python sources, and these should already be highly portable.

A more complex situation is when your package introduces some extensions written, for example, in C. Source distributions will still work provided that the package user has the proper development toolchain in their environment. This consists mostly of the compiler and proper C header files. For such cases, the built distribution format may be better suited because it can provide already built extensions for specific platforms.

Creating source distributions is handled by the sdist command of the setup.py script. That's why they are also commonly referred to as sdist distributions. They are the easiest to create so let's take a look at them first.

sdist distributions

The sdist command is the simplest of the setup.py script distribution commands. It creates a release tree and copies everything that is needed to run the package to it. This tree is then archived in one or many archive files (often, it just creates one tarball). The archive is basically a copy of the source tree.

This command is the easiest way to distribute a package that would be independent of the target system. It creates a dist/ directory for storing the archives to be distributed. Before you create the first distribution, you have to provide a setup() call with a version number. If you don't, the setuptools module will assume the default value of 0.0.0.

To see how it works in action, let's consider the following example of the setup.py script:

from setuptools import setup 
 
setup(name='acme.sql', version='0.1.1') 

Let's now run the sdist command for the acme.sql package in the 0.1.1 version:

$ python setup.py sdist

You should see the following output:

running sdist
...
creating dist
tar -cf dist/acme.sql-0.1.1.tar acme.sql-0.1.1
gzip -f9 dist/acme.sql-0.1.1.tar
removing 'acme.sql-0.1.1' (and everything under it)

If we now list the contents of the dist/ directory, we should see the following output:

$ ls dist/
acme.sql-0.1.1.tar.gz

On Windows, the default archive type will be ZIP.

The version specifier is used in the name of the archive. Now the archive can be distributed and installed on any system that has Python. In the sdist distribution, if the package contains C libraries or extensions, the target system is responsible for compiling them. This is very common for Linux-based systems or macOS because they commonly provide a compiler. But it is less usual to have it working out of the box under Windows.

If a package with extensions is intended to be used on several platforms, it should always be distributed with a prebuilt distribution format as well.

Prebuilt distributions are created with a different set of setup.py script commands. Let's take a look at them.

bdist and wheel distributions

To be able to distribute a prebuilt distribution, setuptools provides the build command. This command compiles the package in the following four steps:

  • build_py: This builds pure Python modules by byte-compiling them and copying them into the build folder.
  • build_clib: This builds C libraries, when the package contains any, using the Python compiler and creating a static library in the build folder.
  • build_ext: This builds C extensions and puts the result in the build folder like build_clib.
  • build_scripts: This builds the modules that are marked as scripts. It also changes the interpreter path when the first line was set (using the !# prefix) and fixes the file mode so that it is executable.

Each of these steps is a command that can also be invoked independently. The result of the compilation process is a build/ folder that contains everything needed for the package to be installed. There's no cross-compilation option in the setuptools package. This means that the result of the command is always specific to the system it was built on.

When some C extensions have to be created, the build process uses the default system compiler and the Python header file (Python.h). For a packaged Python distribution, an extra system package for your system distribution is probably required. At least in popular Linux distributions, it is often named python-dev or python3-dev. It contains all the necessary header files for building Python extensions.

The C compiler used in the build process is the compiler that is the default for your operating system. For a Linux-based system or macOS, this would be gcc or clang respectively. For Windows, Microsoft Visual C++ can be used (there's a free command-line version available). The open-source project MinGW can be used as well. The compiler choice can also be configured through setuptools.

The build command is used by the bdist command to build a binary distribution. It invokes build and all the dependent commands and then creates an archive in the same way as sdist does.

Let's create a binary distribution for acme.sql as follows:

$ python setup.py bdist

If run on macOS, the output could be as follows:

running bdist
running bdist_dumb
running build
...
running install_scripts
tar -cf dist/acme.sql-0.1.1.macosx-10.3-fat.tar .
gzip -f9 acme.sql-0.1.1.macosx-10.3-fat.tar
removing 'build/bdist.macosx-10.3-fat/dumb' (and everything under it)

If we now list the contents of the dist/ directory, we should see the following output:

$ ls dist/
acme.sql-0.1.1.macosx-10.3-fat.tar.gz    acme.sql-0.1.1.tar.gz 

Notice that the newly created archive's name contains the name of the system and the distribution it was built on (macOS 10.3). The same command invoked on Windows will create a different system-specific distribution archive:

C:\acme.sql> python.exe setup.py bdist
...
C:\acme.sql> dir dist
25/02/2008  08:18    <DIR>          .
25/02/2008  08:18    <DIR>          ..
25/02/2008  08:24            16 055 acme.sql-0.1.1.win32.zip
               1 File(s)         16 055 bytes
               2 Dir(s)  22   2222        2 D free

If a package contains C code, apart from a source distribution, it's important to release as many different binary distributions as possible. At the very least, a Windows binary distribution is important for those who most probably don't have a C compiler installed.

A binary release contains all resources required to use the package on the intended system. It mainly contains a folder that is copied into Python's site-packages folder. It may also contain cached bytecode files (the __pycache__/*.pyc files).

The other kind of build distributions are wheels provided by the wheel package. When installed (for example, using pip), the wheel package adds a new bdist_wheel command to the setup.py script. It allows the creation of platform-specific distributions (currently only for Windows, macOS, and Linux) that are better alternatives to normal bdist distributions. It was designed to replace another distribution format introduced earlier by setuptools, called eggs. Eggs are now obsolete, so won't be featured in the book. The list of advantages of using wheels is quite long. Here are the ones that are mentioned on the Python Wheels page available at http://pythonwheels.com/:

  • Faster installation for pure Python and native C extension packages.
  • Avoids arbitrary code execution for installation (avoids setup.py).
  • Installation of a C extension does not require a compiler on Windows, macOS, or Linux.
  • Allows better caching for testing and continuous integration.
  • Creates .pyc files as part of the installation to ensure they match the Python interpreter used.
  • More consistent installs across platforms and machines.

According to Python Packaging Authority (PyPA) recommendations, wheels should be your default distribution format. For a very long time, the binary wheels for Linux were not supported, but that has changed, fortunately. Binary wheels for Linux are called manylinux wheels.

PyPA is a community formed to bring back order and organization to the Python packaging ecosystem. The Python Packaging User Guide (https://packaging.python.org), maintained by PyPA, is the authoritative source of information about the latest packaging tools and best practices.

The process of building manylinux wheels is unfortunately not as straightforward as for Windows and macOS binary wheels. For this kind of wheel, PyPA maintains special Docker images that serve as a ready-to-use build environment. You can find sources of these images and detailed information on how to use them on the project's GitHub page available at https://github.com/pypa/manylinux.

Registering and publishing packages

Packages would be useless without an organized way to store, upload, and download them. The Python Package Index is the main source of open-source packages in the Python community. Anyone can freely upload new packages and the only requirement is to register on the PyPI site at https://pypi.python.org/pypi.

Packages are bound to the user, so, by default, only the user that registered the name of the package is its admin and can upload new distributions. This could be a problem for bigger projects, so there is an option to mark other users as package maintainers so that they are able to upload new distributions too.

You are not, of course, limited to only this index and all Python packaging tools support the usage of alternative package repositories. This is especially useful for distributing closed-source code among internal organizations or for deployment purposes. Here we focus mainly on open-source uploads to PyPI, with only a brief mention of how to specify alternative repositories.

The easiest way to upload a package is to use the following upload command of the setup.py script:

$ python setup.py <dist-commands> upload

Here, <dist-commands> is a list of commands that creates distributions to upload. Only distributions created during the same setup.py execution will be uploaded to the repository. So, if you want to upload the source distribution, built distribution, and wheel package all at once, then you need to issue the following command:

$ python setup.py sdist bdist bdist_wheel upload

When uploading using setup.py, you cannot reuse distributions that were already built during previous distribution command executions and you are instead forced to rebuild them on every upload. This may be inconvenient for large or complex projects where the creation of the actual distribution may take a considerable amount of time. Notable examples are packages leveraging Python/C API extensions (see Chapter 9, Bridging Python with C and C++).

Another problem with setup.py upload is that it could use plain-text HTTP or unverified HTTPS connections on some older Python versions or if your system is not configured properly. This is why Twine is recommended as a secure replacement for the setup.py upload command.

Twine is the utility for interacting with PyPI that currently serves only one purpose—securely uploading packages to the repository. It supports any packaging format and always ensures that the connection is secure. It also allows you to upload files that were already created, so you are able to test distributions before the release. The following example usage of Twine still requires invoking the setup.py script for building distributions:

$ python setup.py sdist bdist_wheel
$ twine upload dist/*

Twine of course won't guess your credentials and you need to provide them in the special .pypirc file. The .pypirc file is a configuration file that stores information about Python package repositories. It should be located in your home directory. The format for this file is as follows:

[distutils]
index-servers =
    pypi
    other
[pypi]
repository: <repository-url>
username: <username>
password: <password>
[other]
repository: https://example.com/pypi
username: <username>
password: <password>

The distutils section should have the index-servers variable that lists all sections describing all the available repositories and credentials for them. There are only the following three variables that can be modified for each repository section:

  • repository: This is the URL of the package repository (it defaults to https://pypi.org/).
  • username: This is the username for authentication in the given repository.
  • password: This is the user password for authentication in the given repository (in plain text).

Note that storing your repository password in plain text may not be the wisest security choice. You can always leave it blank. Twine will prompt you for credentials when it needs them.

Another option for the safe handling of your PyPI credentials is to use the keyring package. It will allow Twine to interact with your system keyring service, like Keychain for macOS or Windows Credential Locker. You can read more about this feature at https://twine.readthedocs.io/en/latest/index.html#keyring-support.

The .pypirc file should be respected by every packaging tool built for Python. While this may not be true for every packaging-related utility out there, it is supported by the most important ones, such as pip, twine, distutils, and setuptools.

The danger of using the .pypirc file with Twine is that Twine is by default set to publish packages on PyPI. That may be a problem if you're working with closed-source code and want to publish your package in a private package index. If you forget to use the proper registry argument (the -r flag) and actually have your .pypirc file configured to work with PyPI, you may accidentally make your closed code accessible to the public.

One of the tools that solves multiple problems of packaging Python code is Poetry. It doesn't require providing custom distribution scripts (the setup.py scipts are replaced with the pyproject.toml configuration file), is fully interactive, and allows you to specify a dedicated package registry together with the source code of your project. Usually, distributing packages with Poetry is as simple as running two commands:

$ poetry build
$ poetry publish

You can learn more about building and publishing packages with Poetry at https://python-poetry.org/docs/cli/#publish.

Package versioning and dependency management

If you have your package published on the package registry, chances are that you will want to modify it at some point and publish a new version of it. In order to allow developers to decide whether they want to use a new release of the package or not, we use version specifiers to tag consecutive releases of the package.

A version specifier generally takes the form of a string composed of numbers separated by dots (like 1.0, 3.6.5, or 4.0.0). That's why version specifiers are also commonly referred to as version numbers. This allows for easy sorting of the version specifiers. By convention, a higher version means a newer release. This convention is assumed by almost every package versioning tool and allows for straightforward updates of outdated packages to their newer version. For instance, with pip you can install a newer package version using the -U switch as in the following example:

$ pip install -U pip
Collecting pip
  Using cached pip-21.0.1-py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.2.4
    Uninstalling pip-20.2.4:
      Successfully uninstalled pip-20.2.4
Successfully installed pip-21.0.1

In the above example, we've used pip to update itself (it is distributed as a package). The output shows that the currently installed pip version was 20.2.4. At the time of running this command, the most recent pip version on PyPI was 21.0.1. pip compared those two version specifiers and decided that the one available on PyPI is a higher version number. It uninstalled the old version and installed the new one in the current environment.

Although package versions are usually composed only of numbers, Python allows you to use letters in version specifiers. This allows you, for instance, to tag specific versions as pre-releases, development releases, or post-release. Those extra version specifier components are usually included as the last version specifier segment just after the numeric segments.

The PEP 440 document (Version Identification and Dependency Specification) is the official standard for versioning packages that, among other things, specifies the following conventions for those special release tags:

  • {a|b|rc}N: Designates the pre-release version (alpha, beta, or release-candidate). These tags designate versions at various stages of development. Alpha releases are the earliest stages and release-candidates are close to being the final versions. A package can have multiple versions in any pre-release stages and they are distinguished by raising the N number. An example progression of pre-release versions could be: 1.0.0a1, 1.0.0a2, 1.0.0b1, and 1.0.0rc. Versions without pre-release tags are considered final and always take precedence over pre-releases with the same number prefix.

    pip does not install pre-releases and development versions by default. If you want to install a pre-release version, you need to use the --pre option of the pip install command.

  • postN: Designates a post-release version. Post-releases are often used to release an update that does not constitute a functional fix or enhancement. Examples could be updates to package metadata or documentation (if it is included in the package distribution). The same version number can have multiple post releases and they are distinguished by raising the N number. Post-releases can also be added on top of pre-releases. Example post-release version specifiers could be 1.0.0-post1, 1.0.0a1.post1, and 1.0.0.a1.post2.
  • devN: Designates a developmental release. Some package maintainers choose to publish packages as part of continuous integration systems and those developmental versions can be used to distinguish consecutive builds of the package. The same version number can have multiple developmental releases distinguished by raising the N number. Developmental releases can also be added on top of pre-releases and post-releases, although this practice is strongly discouraged on general-purpose public package indexes.

You can access the full version of the PEP 440 document at https://www.python.org/dev/peps/pep-0440/.

Pre-releases, post-releases, and dev-releases add some complexity to the package versioning and thus are not used by many package maintainers. Anyway, at least pre-releases can be a useful tool to give developers the ability to preview and evaluate a future release of the package in their own environment.

What matters most is the final version number of the package. There are two popular versioning strategies to decide what number to assign to the new package release:

  • Semantic versioning: This strategy assumes that each numeric component has a semantic value that allows package consumers to infer the amount and scope of changes between two versions.
  • Calendar versioning: This strategy assumes that selected numeric components are derived from the date on which the new release was crafted (or was supposed to be crafted). This allows users to infer the amount of development time that has passed between two versions.

To make things easier, the community has come up with with two standards for those versioning strategies to ease their adoption. Let's take a closer look at them.

The SemVer standard for semantic versioning

The SemVer standard assumes that a version specifier consists of, at most, three numerical segments:

  • The MAJOR segment: Changing the MAJOR segment is a sign of a backward-incompatible change. Users updating between two major versions should expect that their code may no longer be working properly.
  • The MINOR segment: Changing the MINOR segment is a sign of new backward-compatible feature upgrades. Users updating between two minor versions (within the same major version) should not expect their code to become invalid but may receive new functional enhancements.
  • The PATCH segment: Changing the PATCH segment is a sign of bug fixes. Users updating between two patch versions (within the same major and minor versions) should expect some issues to be fixed but should not expect any other enhancements or new features.

A proper SemVer version always includes all three segments in the following order:

MAJOR.MINOR.PATCH

For instance, version 20.2.4 of a package would mean it is on its 20th major update, with 2 minor updates and 4 patches. According to SemVer versioning principles, users updating from version 20.2.0 or 20.1.0 should not expect any braking changes.

The full specification also covers usage of pre-release versions and build numbers and provides guidance on communicating API changes and handling feature deprecation policies. You can access the full specification text at https://semver.org.

CalVer for calendar versioning

CalVer is more of a versioning blueprint than a full-fledged standard (especially when compared to SemVer). It assumes that a version specifier is composed of segments corresponding to elements of the date associated with a particular release.

The site explaining the CalVer convention lists the following common date-based segments:

  • YYYY: Full year: 2006, 2016, 2106
  • YY: Short year: 6, 16, 106
  • 0Y: Zero-padded year: 06, 16, 106
  • MM: Short month: 1, 2 ... 11, 12
  • 0M: Zero-padded month: 01, 02 ... 11, 12
  • WW: Short week (since start of year): 1, 2, 33, 52
  • 0W: Zero-padded week: 01, 02, 33, 52
  • DD: Short day: 1, 2 ... 30, 31
  • 0D: Zero-padded day: 01, 02 ... 30, 31

All CalVer segments are based on the Gregorian calendar.

This convention is best suited for projects that have a well-defined release schedule or are somehow time-sensitive. Example time-sensitive projects are certify (a bundle of Mozilla-curated lists of trusted root certificates that changes regularly) and tzdata (a bundle of IANA time zone databases, see Chapter 3, New Things in Python).

There's no common format of CalVer versions and users of CalVer have to decide on their own which version segments to use. The deciding factor is usually the release cadence of the project. This convention can also be mixed to some extent with semantic versioning. The pip project, for instance, used a versioning scheme composed of YY.MINOR.PATCH segments.

The official site for the CalVer convention isn't as thorough as the SemVer specification but provides some interesting case studies and guidelines for calendar versioning. You can find it at https://calver.org.

Installing your own packages

Working with setuptools is mostly about building and distributing packages. However, you still need to use setuptools to install packages directly from project sources. And the reason for that is simple. It is a good habit to test if our packaging code works properly before submitting your package to PyPI. And the simplest way to test it is by installing it. If you send a broken package to the repository, then in order to re-upload it, you need to increase the version number.

Testing if your code is packaged properly before the final distribution saves you from unnecessary version number inflation and obviously from wasting your time.

Installing packages directly from sources

Installation directly from your own sources using setuptools may be essential when working on multiple related packages at the same time:

setup.py install

The install command installs the package in your current Python environment. It will try to build the package if no previous build was made and then inject the result into the filesystem directory where Python is looking for installed packages. If you have an archive with a source distribution of some package, you can decompress it in a temporary folder and then install it with this command. The install command will also install dependencies that are defined in the install_requires argument. Dependencies will be installed from PyPI.

When installing a package, an alternative to the setup.py script is to use pip. Since it is a tool that is recommended by PyPA, you should use it even when installing a package in your local environment just for development purposes. In order to install a package from local sources, run the following command:

pip install <project-path>

If you want to install a package from the distribution archive, this command becomes:

pip install <path-to-archive>

Amazingly, the setup.py script lacks the uninstall command. Fortunately, it is possible to uninstall any Python package using pip as follows:

pip uninstall <package-name>

Uninstalling can be a dangerous operation when attempted on system-wide packages. This is another reason why it is so important to use virtual environments for any development.

Installing packages through the setup.py script or the pip install command copies the sources of the package (or contents of the distribution) to your site-packages directory. But sometimes we want to make package sources available in a specific environment without copying them. This method of installation is called editable-mode installation and is especially useful when working on multiple related packages that have independent source trees.

Installing packages in editable mode

Packages installed with setup.py install are copied to the site-packages directory of your current Python environment. This means that whenever you make a change to the sources of that package, you are required to reinstall it. This is often a problem during intensive development because it is very easy to forget about the need to perform the installation again.

This is why setuptools provides an extra develop command that allows you to install packages in the development mode. This command creates a special link to project sources in the deployment directory (site-packages) instead of copying the whole package there. Package sources can be edited without the need for reinstallation and are available in sys.path as if they were installed normally.

pip also allows you to install packages in such a mode. This installation option is called editable mode and can be enabled with the -e parameter in the install command as follows:

pip install -e <project-path>

Once you install the package in your environment in editable mode, you can freely modify the installed package in place and all the changes will be immediately visible without the need to reinstall the package.

Using editable mode helps when you need to work with multiple related packages without the need to reinstall them continuously. Another practice that is helpful in projects composed of multiple related packages is using namespace packages.

Namespace packages

The Zen of Python says the following about namespaces:

Namespaces are one honking great idea – let's do more of those!

And this can be understood in at least two ways. The first is a namespace in the context of the language. We all use the following namespaces without even knowing:

  • The global namespace of a module
  • The local namespace of the function or method invocation
  • The class namespace

The other kind of namespaces can be provided at the packaging level. These are namespace packages. This is often an overlooked feature of Python packaging that can be very useful in structuring the package ecosystem in your organization or in a very large project.

Namespace packages can be understood as a way of grouping related packages, where each of these packages can be installed independently.

Namespace packages are especially useful if you have components of your application developed, packaged, and versioned independently but you still want to access them from the same namespace. This also helps to make clear to which organization or project every package belongs. For instance, for some imaginary Acme company, the common namespace could be acme. Therefore, this organization could create the general acme namespace package that could serve as a container for other packages from this organization. For example, if someone from Acme wants to contribute to this namespace with, for example, an SQL-related library, they can create a new acme.sql package that registers itself in the acme namespace.

It is important to know what's the difference between normal and namespace packages and what problem they solve. Normally (without namespace packages), you would create a package called acme and a sql sub-package with the following file structure:

acme/
├── acme
│   ├── __init__.py
│   └── sql
│       └── __init__.py
└── setup.py

Whenever you want to add a new sub-package, let's say templating, you are forced to include it in the source tree of acme as follows:

acme/
├── acme
│   ├── __init__.py
│   ├── sql
│   │   └── __init__.py
│   └── templating
│       └── __init__.py
└── setup.py

Such an approach makes the independent development of acme.sql and acme.templating almost impossible. The setup.py script will also have to specify all dependencies for every sub-package. It is impossible (or at least very hard) to have an optional installation of some of the acme components. Also, with enough sub-packages, it may be hard to avoid dependency conflicts.

With namespace packages, you can store the source tree for each of these sub-packages independently as follows:

acme.sql/
├── acme
│   └── sql
│       └── __init__.py
└── setup.py
acme.templating/
├── acme
│   └── templating
│       └── __init__.py
└── setup.py

And you can also register them independently in PyPI or any package index you use. Users can choose which of the sub-packages they want to install from the acme namespace, but they never install the general acme package (it doesn't even have to exist). Example pip usage would be as follows:

$ pip install acme.sql acme.templating

Note that the setuptools.find_packages() function does not find namespace packages. If you want your setup.py script to collect namespace packages automatically instead of listing them individually, you need to use the setuptools.find_namespace_packages() instead.

This function will automatically discover namespace packages in directory structures as presented in the previous example.

Packages and namespace packages are concerned mostly with sharing code between projects that run in various environments. If you install such a package in a given environment, it will be immediately available for imports. But that's not the only purpose of Python packaging. Many Python projects provide shell utilities, commands, or even applications with graphical interfaces. A great example is the pip command distributed with the pip package. You can use the Python packaging infrastructure to surface your application scripts and executable modules in the target installation environment the same way the pip package does. Let's see how to do this.

Package scripts and entry points

Every Python module can be executed as if it were a program using the python -m command. This includes standard library modules as well as modules from packages installed by pip. For instance, the following is an invocation of the json.tool module from the standard library that allows you to format JSON text in your shell:

$ echo '{"name": "John Doe", "age": 42}' | python -m json.tool
{
    "name": "John Doe",
    "age": 42
}

That's a simple way to execute any module from the installed package, but not the most convenient one. Most of all, users of your package will have to know what the structure of modules is inside of your application and know which modules are supposed to be run in the shell. Also, users will have to type the python -m command, which adds a bit of redundancy to their scripts. That's why when using pip, we'd rather invoke the pip command than python -m pip.

When writing your own Python packages, you can do the same as what the pip package does and provide your own custom shell command that will be installed together with your package. There are two ways to do that:

  • Through the scripts argument of the setuptools.setup() function
  • Through the entry_points argument of the setuptools.setup() function

The scripts argument is the most basic method of providing shell commands through your package. The argument is already supported by the distutils module (the standard library module that setuptools is based on) so it is quite simple. It accepts a list of script file paths that are to be distributed with your package. After package installation, these scripts become available in one of the PATH directories associated with your Python environment.

To see how it works, we will reuse the example of the script that finds imports within Python sources from Chapter 3, New Things in Python. The full code and detailed explanation can be found in that chapter. We will start by creating the findimports.py file with the following contents:

import os
import re
import sys
import_re = re.compile(r"^\s*import\s+\.{0,2}((\w+\.)*(\w+))\s*$")
import_from_re = re.compile(
    r"^\s*from\s+\.{0,2}((\w+\.)*(\w+))\s+import\s+(\w+|\*)+\s*$"
)
def main():
    if len(sys.argv) != 2:
        print(f"usage: {os.path.basename(__file__)} file-name")
        sys.exit(1)
    with open(sys.argv[1]) as file:
        for line in file:
            if match := import_re.match(line):
                print(match.groups()[0])
            if match := import_from_re.match(line):
                print(match.groups()[0])
if __name__ == "__main__":
    main()

From there, we will create the following setup.py script with some basic metadata and the scripts argument:

from setuptools import setup
setup(
    name="findimports",
    version="0.0.0",
    py_modules=["findimports.py"],
    scripts=["findimports"],
)

Now you are able to install the package either in editable mode using one of the following commands:

$ pip install -e .
$ python setup.py develop

Or if you prefer, you can install the package in normal mode:

$ pip install .
$ python setup.py install

Once we've installed the package, the findimports module will be available as a shell command. On macOS or Linux, we can use compgen and grep to search through all discoverable commands and see that it is now indeed available in your shell:

$ compgen -c | grep findimports
findimports.py

As you see, the findimports.py script is now available under the name that is exactly the same as the script file name. If you really want to omit the .py extension from the shell command, you have one of two options:

  • Remove the .py extension from the module file name: You will have to update the setup.py script accordingly. The drawback of this approach is that you will no longer be able to distribute the findimports module as an importable Python module (the py_modules argument). It would also make unit-testing of the script module harder.
  • Create a wrapper script for findimports.py: The scripts argument allows you to distribute any type of script including shell scripts. Here, we could create a wrapper shell script with a name without an extension (for instance, scripts/findimports) and specify it as the target of the scripts argument. The file could be as simple as the following:
    #!/usr/bin/env sh
    python -m findimports
    

The problems with script file extensions and wrapper scripts in distutils can be avoided thanks to the entry_points extension offered by the setuptools module. It is a standardized way to provide application entry points (like shell scripts) via the configuration in the setup.py distribution script. It allows you to target any function within your package sources to be distributed as a shell script. This greatly simplifies the management of application entry points because you don't need to create dedicated runnable modules.

There are various types of entry points possible but the most common is console_scripts, which allows you to register the module or function as a target of the autogenerated script command. The following is an example of the console entry point we could provide for our findimports script:

from setuptools import setup
setup(
    name="findimports",
    version="0.0.0",
    py_modules=["findimports"],
    entry_points={
        "console_scripts": ["findimports=findimports:main"]
    }
)

The usage of console entry points is more flexible when it comes to the naming of commands and the selection of what to exactly run when a command is invoked. On the left side of the = sign, we have the desired name of the command. In our case, it is simply findimports. On the right side, we have a module import path (findimports again) together with the name of the function (the main() function) to execute.

The entry_points argument allows for better naming of commands as well as packing multiple commands into a single Python module. But it doesn't mean that the scripts argument becomes useless. You can't, for instance, package shell scripts (like Bash) with entry_points but you can do that with the scripts argument.

The feature of entry points in the setuptools package is in fact a generic method of advertising hooks between packages. Every package can query for existing entry points of other packages. This feature can be used, for instance, to create a plugin mechanism. The pytest unit-testing framework is an example package that uses the mechanism of entry points for its plugin system. You can learn more about writing pytest plugins at https://docs.pytest.org/en/stable/writing_plugins.html.

Python packaging, thanks to binary wheels and features that allow packaging scripts, can be a method of distributing complete applications. If you use virtual environments, you can ensure a sensible amount of dependency isolation between various applications.

Unfortunately, Python packaging and virtual environments don't solve all environment isolation problems. You cannot, for instance, shield your applications from changes in shared system libraries through virtual Python environments. Also, not every Python dependency you use will be distributed in a binary wheel format. Python extensions written in C, C++, or Cython are amazingly popular, which means that for complex applications, an on-site compilation may often be required. Lack of pure dependency isolation and a common need for on-site compilation are the main reasons why Python packages often aren't reliable distribution artifacts for specific use cases. One such use case is packaging applications and services for the web.

Packaging applications and services for the web

The distribution of software is a process that traditionally requires two parties. Someone (the distributor) has to make the software release available to be consumed. In the past, it required physical mediums like floppy disk or CD, but nowadays it is usually done through the internet. Someone else (the consumer) needs to consciously obtain the software and install it on their own computer. It's not always the same for software updates as many applications offer automated updates. Still, these updates usually require the user's consent in order to be installed.

With the advent of Software as a Service (SaaS), less and less software is distributed in a form that would allow it to be installed on the user's own computer. We see that classic programs are gradually being replaced by their SaaS counterparts:

  • Traditional desktop applications are being replaced with web-based software
  • Traditional software libraries are being replaced by web APIs

Web-based software isn't distributed to its users the same way as a traditional web desktop application. Users of web-based applications usually interact with them through a standard web browser or a dedicated client that acts as a mere shell for your code that lives on some server or cluster of servers. It has to indeed be distributed to these servers anyway but the whole process is usually opaque to the end users and they rarely are aware of this process.

That's why many developers often prefer the term shipping in the context of web-based applications: consumers consciously sign up as users of the software but have very limited control over how and when it will be delivered. Also, potential updates are just shoved through their door, and cannot be easily rejected or discarded.

Web-based applications are increasingly popular. Even applications that are primarily intended for desktop use often provide web-based capabilities like automated updates, cloud synchronization, or online collaboration. It means that it is worth knowing the basics of shipping those web applications even if the web is not your thing.

In this section, we will discuss good practices and tools for building and distributing web applications together with some Python-specific tips and tricks.

The Twelve-Factor App manifesto

Being able to distribute software only to your own servers removes one important factor from the distribution process: users. You don't need to care if they are able to download your application and handle the installation process. You don't have to care about their operating system (although you may need to care about their browser). And also, most of the time, you don't have to ask for permission to perform the update. You can do whatever you want. But should you?

Web-based software comes with a lot of advantages. You have full control of it. You can do as many updates as you want and whenever you want. But that is a double-edged sword. Users of web applications will expect frequent updates and almost immediate fixes for problems they submit. Also, if your software becomes successful, you will have to rely on a large fleet of servers to support the growing scale of your user base. And a large user base is usually the goal of web-based applications.

That's why it is extremely important to build your software in a way that will enable its growth at a sustainable pace. Your application should be easily configurable and decoupled from its dependencies (like external services and the operating system) to ensure easy maintenance and straightforward, repeatable deployments of new versions. It should also be as easy to deploy in a production environment as to run locally for development (and vice versa).

That's of course not easy to do without some operational knowledge. If you have not got much experience working with software on a large scale, you will definitely make a lot of mistakes that will cost you a lot of time, resources, and money (server costs for instance). That's why it is a good idea to follow a set of good, proven practices.

The Twelve-Factor App manifesto is a good set of such practices. It is a general language-agnostic methodology for building SaaS apps. One of its purposes is making applications easier to deploy, but it also highlights other topics such as maintainability or making applications easier to scale.

As the name says, the Twelve-Factor App consists of 12 rules:

  1. Codebase: One codebase tracked in a revision control system and many deploys
  2. Dependencies: Explicitly declare and isolate dependencies
  3. Config: Store configurations in the environment
  4. Backing services: Treat backing services as attached resources
  5. Build, release, run: Strictly separate build and run stages
  6. Processes: Execute the app as one or more stateless process
  7. Port binding: Export services via port binding
  8. Concurrency: Scale out via the process model
  9. Disposability: Maximize robustness with fast startup and graceful shutdown
  10. Dev/prod parity: Keep development, staging, and production as similar as possible
  11. Logs: Treat logs as event streams
  12. Admin processes: Run administration/management tasks as one-off processes

You can access the full text of the Twelve-Factor App manifesto at https://12factor.net.

We won't discuss every factor in detail as the Twelve-Factor App website provides a great explanation and rationale for all of them. We will anyway zoom into specific rules because some of them can be employed using popular tools, techniques, or libraries that are popular in the Python ecosystem.

Leveraging Docker

We've already introduced Docker in Chapter 2, Modern Python Development Environments, as a lightweight virtualization tool that can provide great development environment isolation.

It simply packages all your code and its runtime dependencies (modules, packages, shared libraries) into container images that can be executed as isolated containers in given environments.

Moreover, Docker containers are stateless. This means that two containers started from the same image will have the same initial state. Every filesystem modification done within a container stays inside of the container. Part of the filesystem inside of a container can be of course exported outside by mounting a dedicated volume but this is always explicit and never happens by accident. A container that has finished its work (the main process exited, either gracefully or due to abrupt termination) is out of use, the same as its internal state.

In fact, Docker containers do not vanish by default after they exit. The automatic removal of exited containers is ensured with the --rm flag of the docker run command. It is possible to resume working with the container after it has finished, although this should be used only for inspection and not as a default means of operation.

The way Docker containers and their images are defined, run, and managed already ticks multiple checkboxes of the Twelve-Factor App manifesto:

  • Dependencies: To create a new Docker image, you need to define a Dockerfile, which is a declarative statement of all the preparation steps. This includes all shared libraries, packages, and your own code. Moreover, the multi-stage Docker builds allow you to separate build-time dependencies from runtime builds. Dependencies are isolated. You can have multiple containers from different Docker images running on the same host systems and their dependencies will never conflict.
  • Build, release, run: Docker images are usually built outside of their dedicated runtime environment. It can be a dedicated build server or even your own computer used for development. Images are usually stored in a dedicated image repository. From there, Docker daemons running in target environments can pull the latest image version. Moreover, the tagging of images with descriptive labels allows you to easily keep track of their versions and even designation for a specific environment.
  • Processes: Docker containers are stateless. Moreover, a container looks like a single process from the perspective of the operating system that hosts it. It sandboxes all threads or subprocesses that may be running within a container as well as all resources it may use (memory, for instance).
  • Dev/prod parity: Packaging software into containers allows you to reduce the gap between production and development environments because it isolates a lot of dependencies from the operating system. Also, Docker Compose allows you to compose whole applications from multiple containers and use the same versions of backing services (databases, caches, reverse proxies, and so on) as the ones used in the production environment.

The great thing about Docker is the portability of the applications. As long as your target system can run the Docker daemon, it will be able to run your containers.

If you operate your own cluster of servers (physical or virtual), you will have to provision them with the Docker daemon and also provide some configuration and/or scripting that will ensure your containers are always up and running. But this is something you would have to do anyway with any kind of software. Docker may make your life easier because every application will have the same type of deliverable—a container image—and will not require a complex installation process. The management of containers alone can be done, for instance, with systemd, a common system and service manager found in most Linux distributions.

We've discussed the topic of creating Docker images using Dockerfiles in Chapter 2, Modern Python Development Environments. You can learn more about best practices for writing Dockerfiles at https://docs.docker.com/develop/develop-images/dockerfile_best-practices/.

But not all organizations are willing to support their own infrastructure. Fortunately, many cloud providers offer various services that can take a lot of the operation burden from Docker users. For a larger scale, you can use dedicated container orchestration systems like Kubernetes (k8s). Kubernetes is a container orchestration system designed by Google. It organizes collections of application containers that should run on the same cluster node into groups called Pods. Kubernetes can manage container volumes, configuration maps, control automated scaling of services, and manage communication within the cluster as well as incoming traffic.

You can learn more about Kubernetes at https://kubernetes.io.

Kubernetes can handle a range of container orchestration needs, from managed Kubernetes clusters where you can decide how many worker nodes you need and how to configure them, to fully serverless offerings where you simply provide Docker images with their configuration and the cloud provider takes care of scaling the infrastructure for you. Flexible on-demand pricing often means you pay only for allocated resources. This allows you to avoid large upfront infrastructure costs and to "scale as you grow."

Docker is of course not the only way for applications to be portable between hosts or service providers. But regardless of the packaging format, your application won't be portable if it isn't configurable in a system- and application-agnostic way. Let's take a look at typical configuration options for applications.

Handling environment variables

Every application will require configuration values that will vary between environments. Examples can be:

  • Connection strings (URLs), hostnames, and ports of backing services like caches, databases, proxy servers, or web APIs
  • Credentials to those services
  • Other secrets like encryption keys and client certificates
  • Per-environment values like feature toggles or resource limits

These configuration values should always be separated from the application code and definitely shouldn't be stored as constants in modules. That's especially important for values that have to be kept secret. There are multiple reasons for that:

  • The first one is security. If the code contains information about secrets and credentials, whoever gets access to the code will know them all. And if someone gets access to the code repository, they will know all and also past secrets. That poses a real security risk.
  • Another reason for decoupling configuration from applications is the volatility of environments—they come and go. On one day, you may work with just a few environments, but on another day, you may want to create more of them. What if you want to create a new short-lived environment for every feature branch you work on? What if you would like to do the same for every team member on the project. Do you really want to keep all those configurations in the same project repository?
  • Last but not least, the configuration should be language- and framework-agnostic. You will eventually use different technologies to run your software. You may change your framework or maybe even move from Python to a completely different language. You may also want to migrate from one infrastructure to another at some point in time. Today it may be a simple application running in a virtual environment on one host but tomorrow it may be a Docker container in a Kubernetes cluster. Or even some serverless function managed by your cloud provider. You never know how your application will evolve so you need to be sure that the way you provide configuration to your application is as generic as possible.

The most universal way to provide configuration to your application is through environment variables. This is a simple key-value mapping that should be supported by every operating system and every programming language. They can be easily changed without any code or file modification. They are stored only in a running process environment (which is ephemeral) so are also better suited for providing secret values to your application.

The biggest advantage of using environment variables for configuration is that they can be completely decoupled from application source code. Thanks to this, you will be able to use the same deployable artifact (like a Docker container image or Python package) in various environments and tune it just by providing new environment variable values on application startup. This approach reduces the version drift between environments and allows you to avoid the bundling of secret variables in your application packages. Also, you may eventually decide to use code written in various frameworks or even languages. Environment variables allow the same configuration medium across different technologies (as opposed to dedicated configuration files or modules).

Using environment variables is easy. If you're working on Linux, macOS, or another POSIX-compliant system, you can set a new environment variable value using the export command as in the following example:

$ export MY_VARIABLE="my-value"

In those systems, you can also set specific variables just for the scope of one command invocation. You do that by prepending a series of variables to the command:

$ VARIABLE_1="value-1" command

On Windows, if you're using PowerShell, you can set an environment variable value through the special $env variable:

$ $env:TEAMS="my-value"

If you use CMD on Windows, you can also use the set command:

$ set MY_VARIABLE="my-value"

Environment variable names on Linux and macOS are case-sensitive but on Windows are case-insensitive. That's why a good convention is to use the uppercase naming convention for environment variables, the same as you would do for constants in code.

As you can see, depending on the environment, there are different ways to set the environment variables. Moreover, for container-orchestration systems like Kubernetes or provider-specific cloud services, you won't be interacting with the system shell directly. You will usually be setting the desired environment values through dedicated service manifest files or the provider API.

What doesn't change between those environments is the way you read those variables. Environment variables in Python are exposed in the environ variable in the built-in os module. It is a dict-like object that allows access to and the modification of environment variables.

os.environ can be accessed at any time but the common convention is to create a single module in your application that accesses all environment variables. Thanks to this, you get a good overview of all configuration options supported by the application and are in control of all value processing and validation.

The example configuration for a small application could be as follows:

import os
DATABASE_URI = os.environ["DATABASE_URI"]
ENCRYPTION_KEY = os.environ["ENCRYPTION_KEY"]
BIND_HOST = os.environ.get("BIND_HOST", "localhost")
BIND_PORT = int(os.environ.get("BIND_PORT", "80"))
SCHEDULE_INTERVAL = timedelta(
    seconds=int(os.environ.get("SHEDULE_INTERVAL_SECONDS", 50))
)

As you can see, os.environ has a common dictionary protocol. If a given variable does not exist, item access through the [key] syntax will raise a KeyError exception. This is a common way to specify environment variables that are required and without which the application will not work.

Analogously, the os.environ.get() method allows you to specify environment variables that are optional or can have a default value. Using defaults is a convenient way to reduce the amount of configuration required for an individual environment. Good targets for defaults are configuration values that usually stay the same for most environments but need to be overridden in specific use cases (a testing environment, for instance). From a security standpoint, defaults should reflect production values rather than development values. That prevents accidental misconfiguration in the most critical environment. Defaults should of course never store secret values.

Last but not least, some values may need conversion to specific data types. That's because environment variable values in the os.environ object are always strings. If you need a specific data type that would be more useful in your code, you need to parse and transform the string value. In the previous example, we see the BIND_PORT value parsed to integer format and SHEDULE_INTERVAL_SECONDS transformed into a timedelta object.

If the amount of environment variables grows, it may be sensible to pack them into a common configuration object that can automate value parsing and bring more structure to the configuration. The Python standard library lacks such a feature but there are plenty of utilities on PyPI that help with handling environment variables.

One such utility is the environ-config package. It allows for automatic prefixing of environment variables and grouping them in descriptive sections. It offers easy validation and transformation of the values. The core of the environ-config package is the environ.config() class decorator and environ.var() descriptor. They are used to define configuration classes that can read values directly from the os.environ object. The following is a reimplementation of the previous configuration module with the usage of the environ-config package:

from datetime import timedelta
import environ
@environ.config(prefix="")
class Config:
    @environ.config()
    class Bind:
        host = environ.var(default="localhost")
        port = environ.var(default="80", converter=int)
    bind = environ.group(Bind)
    database_uri = environ.var()
    encryption_key = environ.var()
    schedule_interval = environ.var(
        name="SCHEDULE_INTERVAL_SECONDS",
        converter=lambda value: timedelta(seconds=int(value)),
        default=50
    )

In order to actually create a configuration object, you can use Config.from_environ() as in the following example:

>>> config = Config.from_environ()
>>> config.bind
Config.Bind(host='localhost', port=80)
>>> config.bind.host
'localhost'
>>> config.schedule_interval
datetime.timedelta(seconds=50)

The configuration classes decorated with the environ.config() decorator will automatically look for environment variables by transforming their attribute names into uppercase. So the config.database_uri attribute is related directly to the DATABASE_URI environment variable. But sometimes you may want to use a specific name instead of an auto-generated one. You can do that easily by providing the name keyword argument to the environ.var() descriptor. We see an example of such usage in the definition of the schedule_interval attribute.

The definition of the Config.Bind class and usage of the environ.group() descriptor show how configurations can be nested. The environ-config package is smart enough to prefix requested environment variable names with the name of the group attribute. It means that the Config.bind.host attribute relates to the BIND_HOST environment variable and the Config.bind.port attribute relates to the BIND_PORT environment variable.

But the most useful feature of the environment-config module is the ability to conveniently handle the conversion and validation of environment variables. That can be done with the converter keyword argument. It can be either a type constructor as in the Config.bind.port example or a custom function that takes one positional string argument.

The common technique is to use one-off lambda functions as in the Config.schedule_interval example. Usually, the converter argument is just enough to ensure that the variable has the correct type and value. If that's not enough, you can provide an additional validator keyword argument. It should be a callable that receives the output of the converter function and returns the final result.

The role of environment variables in application frameworks

The role of environment variables within application frameworks that have a dedicated configuration files or modules layout can be unclear. A prime example of such frameworks is the Django framework, which comes with the popular settings.py module. The settings.py module in Django is a module of every application that contains a collection of various runtime configuration variables. It serves two purposes:

  • Statement of application structure within the framework: Django applications are a composition of various components: apps, views, middle-wares, templates, context processors, and so on. The settings.py file is a manifest of all installed apps, used components, and a declaration of their configuration. Most of this configuration is independent from the environment in which the application runs. In other words, it is an integral part of the application.
  • Definition of runtime configuration: The settings.py module is a convenient way to provide environment-specific values that need to be accessed by application components during the application runtime. It is thus a common medium for application configuration.

Having the framework-specific statement of application structure inside the code repository of your application code is something normal. It is indeed part of the application code. Problems arise when this settings.py file holds explicit values for the actual environments where an application is supposed to be deployed.

The common convention among some Django developers is to define multiple settings modules to store project configuration. Those settings modules can be quite large, so usually there is one base settings.py file that holds common configurations and multiple per-environment modules that override specific values (see Figure 11.1).

Obraz zawierający stół

Opis wygenerowany automatycznie

Figure 11.1: Typical layout of settings modules in many Django applications

This design is quite simple, and Django actually supports it out of the box. The Django application will read the value of the DJANGO_SETTINGS_MODULE environment variable on startup to decide which settings module to import. That's why this pattern is so popular.

Although using multiple per-environment settings modules is simple and popular, it has multiple drawbacks:

  • Configuration indirection: Every settings module has to either preserve a copy of common values or import values from a shared common file. Usually, it is the latter. Then if you want to check what is the actual configuration of a specific environment, you have to read both modules.

    In rare situations, developers decide to import parts of configuration between specific environments. In such situations, inspecting the configuration becomes a nightmare.

  • Adding a new environment requires code change: Settings modules are Python code and thus will be tied to the application code. Whenever you need to create a completely new environment, you will have to modify the code.
  • Modifying configuration requires a new repackaging of application: Whenever you modify the code for a configuration change, you need to create a new deployable artifact. A common practice in deployment methodologies is to promote every new version of an application through multiple environments. The common progression is:

    developmenttestingstagingproduction

    With multiple settings modules, a single change for one environment configuration may necessitate redeployments in unaffected environments. This creates operational overhead.

  • A single application holds configurations for all environments: This can pose a security risk if one environment is less secure than others. For instance, an attacker obtaining access to a development environment may gain more information about the possible attack surface in the production environment. This becomes even more problematic if secret values are stored in the configuration.
  • The problem of secret values: Secrets should not be stored on a filesystem and definitely should not be put into the code. Django applications using per-environment settings modules usually read secrets from environment variables anyway (or communicate with dedicated password managers).

We used Django as an example of an application framework because it is extremely popular. But it's not the only framework that has the notion of settings modules and not the only framework where the pattern of multiple per-environment settings modules occurs.

Those frameworks often can't run without their settings modules. That's because settings modules are not only about environment-specific configuration but also about the composition of your application. It means that you cannot easily replace them with a set of environment variables. It would also be very inconvenient as many application-defining values often have to be provided as lists, dictionaries, or specialized data types.

But there is some middle ground. You can have an application that has a dedicated settings module but is still able to satisfy the twelve-factor rule about storing configuration in an environment. This can be achieved by following a few basic principles:

  • Use only one settings module: A settings module should be a statement of the application structure and default behavior (timeout values, for instance) that is completely independent of the environment. In other words, if a specific value never changes between environments, you can put it safely in the settings module.
  • Use environment variables for environment-specific values: If a value changes between environments, it can be exposed as a variable within the settings module, but it should always be read from environment variables. You can still be pragmatic and use defaults in situations where a value needs to be overridden only in very specific circumstances. An example could be a debugging flag that usually is enabled in development environments but rarely in others.
  • Use production defaults: If a configuration variable has a default value, it is easy to miss it when configuring a specific environment. If you decide to use default values for specific configuration variables, always make sure that default values are the ones that can be safely used in a production environment. Examples of values that should be considered with great care are authentication/authorization settings or feature toggles that enable/disable experimental features. By using production defaults, you are shielding your environment from accidental misconfiguration.
  • Never put secrets into settings modules: Secrets can be exposed as variables via a settings module (for instance, by reading them from environment variables) but should never be put there in plain-text format.
  • Do not expose an environment label to an application: An application should be aware of its environment only through the qualities it can experience—specific configuration variables. It should never decide how to behave based on a specific label (development, staging, production, and so on) that you attach to the environment. The only acceptable use case for providing an environment label to the application is providing context to logging and telemetry utilities.

    We will talk more about logging and telemetry (including environment labeling) in Chapter 12, Observing Application Behavior and Performance.

  • Avoid per-environment .env files in your repository: There's a common practice of writing down environment variables into so-called .env files. Those variables can be later exported through a shell script or read directly inside of the settings module. Simply avoid the urge to follow the practice to provide per-environment .env files inside of your code repository. It has all the downsides of per-environment settings modules and only increases the amount of configuration indirection.

There's one acceptable use case for .env files. It is providing a configuration template for local development purposes that developers could use to quickly set up their own unique local development environment. Tools for local development like Docker Compose can understand .env files and export their values to an application container. Still, this practice should never be expanded to other environments. Also, it is better to use a scripting layer (or Docker Compose support) and to export .env files as real environment variables than to use dedicated libraries that could read those files directly from the filesystem.

The above set of principles is a pragmatic tradeoff between pure environment-based configuration and classic settings modules. Environment variables can be conveniently read through the os.environ object, environ-config package, or any other dedicated utility.

This methodology of course requires some experience in deciding which values would be environment-specific. It thus doesn't guarantee that you will never have to modify the code just to reconfigure a specific environment. The need to modify the code just for the sake of configuration change will definitely happen more often if you decide to heavily rely on default values. That's why it is usually better to avoid defaults if the value for a specific variable can be different in at least one environment.

Packaging applications that need to run on remote servers concentrates on isolation, configurability, and repeatability. Usually, we have full control over the servers and infrastructure where our code runs and can build dedicated architectures, such as container orchestration systems, to support and simplify the whole packaging process. However, things change dramatically when you are not the owner or administrator of the target environment and need your users to install or run your application themselves. This is a common case for desktop applications that are installed on users' personal computers. In such a situation, we usually build standalone executables that operate like any other standalone application. Let's see how to build such executables for Python.

Creating standalone executables

Creating standalone executables is a commonly overlooked topic in materials that cover the packaging of Python code. This is mainly because Python lacks proper tools in its standard library that could allow programmers to create simple executables that could be run by users without the need to install the Python interpreter.

Compiled languages have a big advantage over Python in that they allow you to create an executable application for the given system architecture that could be run by users in a way that does not require them to have any knowledge of the underlying technology. Python code, when distributed as a package, requires the Python interpreter in order to be run. This creates a big inconvenience for users who do not have enough technical proficiency.

Developer-friendly operating systems, such as macOS or most Linux distributions, come with the Python interpreter preinstalled. So, for their users, the Python-based application still could be distributed as a source package that relies on a specific interpreter directive in the main script file that is popularly called shebang. For most Python applications, this takes the following form:

#!/usr/bin/env python

Such a directive when used as the first line of script will mark it to be interpreted in the default Python version for the given environment. This can, of course, take a more detailed form that requires a specific Python version such as python3.9, python3, python2, and so on. Note that this will work in most popular POSIX systems but isn't portable at all. This solution relies on the existence of specific Python versions and also the availability of an env executable exactly at /usr/bin/env. Both of these assumptions may fail on some operating systems. Also, shebang will not work on Windows at all. Additionally, the bootstrapping of the Python environment on Windows can be a challenge even for developers, so you cannot expect nontechnical users to be able to do that by themselves.

The other thing to consider is the simple user experience in the desktop environment. Users usually expect applications to be run from the desktop by simply double-clicking on the executable file or the shortcut to the application. Not every desktop environment will support that with Python applications distributed in source form.

So, it would be best if we are able to create a binary distribution that would work as any other compiled executable. Fortunately, it is possible to create an executable that has both the Python interpreter and our project embedded. This allows users to open our application without caring about Python or any other dependency.

Let's see some specific use cases for standalone executables.

When standalone executables are useful

Standalone executables are useful in situations where the simplicity of the user experience is more important than the user's ability to interfere with the application's code.

Note that the fact that you are distributing applications as executables only makes code reading or modification harder, not impossible. It is not a way to secure application code and should only be used as a way to make interacting with the application simpler.

Standalone executables should be a preferred way of distributing applications for non-technical end users and also seems to be the only reasonable way of distributing any Python application for Windows.

Standalone executables are usually a good choice for the following:

  • Applications that depend on specific Python versions that may not be easily available on the target operating systems
  • Applications that rely on modified precompiled CPython sources
  • Applications with graphical interfaces
  • Projects that have many binary extensions written in different languages
  • Games

Creating Python executables may not be straightforward but there are some tools that may ease the process. Let's take a look at some popular choices.

Popular tools

Python does not have any built-in support for building standalone executables. Fortunately, there are some community projects solving that problem with a varied amount of success. The following four are the most notable:

  • PyInstaller
  • cx_Freeze
  • py2exe
  • py2app

Each one of them is slightly different in use and also all have slightly different limitations. Before choosing your tool, you need to decide which platform you want to target, because every packaging tool can support only a specific set of operating systems.

It is best if you make such a decision at the very beginning of the project's life. Although none of these tools requires complex integration in your code, if you start building standalone packages early, you can automate the whole process and definitely save some future development time. If you leave this for later, you may find yourself in a situation where the project is built in such a sophisticated way that none of the available tools will work out of the box. Providing a standalone executable for such a project will be problematic and will take a lot of effort.

Let's take a look at PyInstaller in the next section.

PyInstaller

PyInstaller is by far the most advanced program to freeze Python packages into standalone executables. It provides the most extensive multiplatform compatibility among every available solution at the moment, so it is the most highly recommended one. PyInstaller supports the following platforms:

  • Windows (32-bit and 64-bit)
  • Linux (32-bit and 64-bit)
  • macOS (32-bit and 64-bit)
  • FreeBSD, Solaris, and AIX

The documentation for PyInstaller can be found at http://www.pyinstaller.org/.

At the time of writing, the latest version of PyInstaller supports all Python versions from 3.5 to 3.9. It is available on PyPI, so it can be installed in your working environment using pip. If you have problems installing it this way, you can always download the installer from the project's page.

Unfortunately, cross-platform building (cross-compilation) is not supported, so if you want to build your standalone executable for a specific platform, then you need to perform building on that platform. This is not a big problem today with the advent of many virtualization tools. If you don't have a specific system installed on your computer, you can always use VirtualBox or a similar system virtualization tool, which will provide you with the desired operating system as a virtual machine.

Usage for simple applications is pretty straightforward. Let's assume our application is contained in the script named myscript.py. This is a simple hello world application. We want to create a standalone executable for Windows users, and we have our sources located under D://dev/app in the filesystem. Our application can be bundled with the following short command:

$ pyinstaller myscript.py

The output you will see may be as follows:

2121 INFO: PyInstaller: 3.1
2121 INFO: Python: 3.9.2
2121 INFO: Platform: Windows-7-6.1.7601-SP1
2121 INFO: wrote D:\dev\app\myscript.spec
2137 INFO: UPX is not available.
2138 INFO: Extending PYTHONPATH with paths ['D:\\dev\\app', 'D:\\dev\\app']
2138 INFO: checking Analysis
2138 INFO: Building Analysis because out00-Analysis.toc is non existent
2138 INFO: Initializing module dependency graph...
2154 INFO: Initializing module graph hooks...
2325 INFO: running Analysis out00-Analysis.toc
(...)
25884 INFO: Updating resource type 24 name 2 language 1033 

PyInstaller's standard output is quite long, even for simple applications, so it has been truncated in the preceding example for the sake of brevity. On Windows, the resulting structure of directories and files created by PyInstaller may look as follows:

project/
├── myscript.py
├── myscript.spec
├───build/
│   └───myscript/
│       ├── myscript.exe
│       ├── myscript.exe.manifest
│       ├── out00-Analysis.toc
│       ├── out00-COLLECT.toc
│       ├── out00-EXE.toc
│       ├── out00-PKG.pkg
│       ├── out00-PKG.toc
│       ├── out00-PYZ.pyz
│       ├── out00-PYZ.toc
│       └── warnmyscript.txt
└───dist/
    └───myscript/
        ├── bz2.pyd
        ├── Microsoft.VC90.CRT.manifest
        ├── msvcm90.dll
        ├── msvcp90.dll
        ├── msvcr90.dll
        ├── myscript.exe
        ├── myscript.exe.manifest
        ├── python39.dll
        ├── select.pyd
        ├── unicodedata.pyd
        └── _hashlib.pyd

The dist/myscript directory contains the built application that can now be distributed to users. Note that the whole directory must be distributed. It contains all the additional files that are required to run our application (DLLs, compiled extension libraries, and so on). A more compact distribution can be obtained with the --onefile switch of the pyinstaller command as follows:

$ pyinstaller --onefile myscript.py

The resulting file structure will then look as follows:

project/
├── myscript.py
├── myscript.spec
├───build
│   └───myscript
│       ├── myscript.exe
│       ├── myscript.exe.manifest
│       ├── out00-Analysis.toc
│       ├── out00-COLLECT.toc
│       ├── out00-EXE.toc
│       ├── out00-PKG.pkg
│       ├── out00-PKG.toc
│       ├── out00-PYZ.pyz
│       ├── out00-PYZ.toc
│       └── warnmyscript.txt
└───dist/
    └── myscript.exe

When built with the --onefile option, the only file you need to distribute to other users is the single executable found in the dist directory (here, myscript.exe). For small applications, this is probably the preferred option.

One of the side effects of running the pyinstaller command is the creation of the *.spec file. This is an auto-generated Python module containing the specification on how to create executables from your sources. This is the example specification file created automatically for myscript.py code:

# -*- mode: python -*- 
 
block_cipher = None 
 
 
a = Analysis(['myscript.py'], 
             pathex=['D:\\dev\\app'], 
             binaries=None, 
             datas=None, 
             hiddenimports=[], 
             hookspath=[], 
             runtime_hooks=[], 
             excludes=[], 
             win_no_prefer_redirects=False, 
             win_private_assemblies=False, 
             cipher=block_cipher) 
pyz = PYZ(a.pure, a.zipped_data, 
             cipher=block_cipher) 
exe = EXE(pyz, 
          a.scripts, 
          a.binaries, 
          a.zipfiles, 
          a.datas, 
          name='myscript', 
          debug=False, 
          strip=False, 
          upx=True, 
          console=True ) 

This .spec file contains all the pyinstaller arguments specified earlier. This is very useful if you have performed a lot of customizations on your build. Once created, you can use it as an argument to the pyinstaller command instead of your Python script as follows:

$ pyinstaller.exe myscript.spec

Note that this is a real Python module, so you can extend it and perform more complex customizations to the build procedure. Customizing the .spec file is especially useful when you are targeting many different platforms. Also, not all of the pyinstaller options are available through the command-line interface. The .spec file allows you to use every possible PyInstaller feature.

PyInstaller is an extensive tool, which is suitable for the great majority of programs. Anyway, thorough reading of its documentation is recommended if you are interested in using it as a tool to distribute your applications.

Let's take a look at cx_Freeze in the next section.

cx_Freeze

cx_Freeze is another tool for creating standalone executables. It is a simpler solution than PyInstaller, but also supports the following three major platforms:

  • Windows
  • Linux
  • macOS

Documentation for cx_Freeze can be found at https://cx-freeze.readthedocs.io.

At the time of writing, the latest version of cx_Freeze supports all Python versions from 3.6 to 3.9. It is available on PyPI, so it can be installed in your working environment using pip.

Similar to PyInstaller, cx_Freeze does not allow you to perform cross-platform builds, so you need to create your executables on the same operating system you are distributing to. The major disadvantage of cx_Freeze is that it does not allow you to create real single-file executables. Applications built with it need to be distributed with related DLL files and libraries.

Let's assume that we want to package a Python application for Windows with cx_Freeze. The minimal example usage is very simple and requires only one command:

$ cxfreeze myscript.py

The output you will see may be as follows:

copying C:\Python39\lib\site-packages\cx_Freeze\bases\Console.exe -> D:\dev\app\dist\myscript.exe
copying C:\Windows\system32\python39.dll ->
D:\dev\app\dist\python39.dll
writing zip file D:\dev\app\dist\myscript.exe
(...)
copying C:\Python39\DLLs\bz2.pyd -> D:\dev\app\dist\bz2.pyd
copying C:\Python39\DLLs\unicodedata.pyd -> D:\dev\app\dist\unicodedata.pyd

The resulting structure of the files may be as follows:

project/
├── myscript.py
└── dist/
    ├── bz2.pyd
    ├── myscript.exe
    ├── python39.dll
    └── unicodedata.pyd

Instead of providing its own format for build specification (like PyInstaller does), cx_Freeze extends the distutils package. This means you can configure how your standalone executable is built with the familiar setup.py script. This makes cx_Freeze very convenient if you already distribute your package using setuptools or distutils because additional integration requires only small changes to your setup.py script. Here is an example of such a setup.py script using cx_Freeze.setup() for creating standalone executables on Windows:

import sys
from cx_Freeze import setup, Executable
# Dependencies are automatically detected, 
# but it might need fine tuning.
build_exe_options = {"packages": ["os"], "excludes": ["tkinter"]}
setup(
    name="myscript",
    version="0.0.1",
    description="My Hello World application!",
    options={
        "build_exe": build_exe_options
    },
    executables=[Executable("myscript.py")]
)

With such a file, the new executable can be created using the new build_exe command added to the setup.py script as follows:

$ python setup.py build_exe

The usage of cx_Freeze may seem a bit more Pythonic than PyInstaller, thanks to the distutils integration. Unfortunately, this project may cause some trouble for inexperienced developers due to the following reasons:

  • Installation using pip may be problematic under Windows
  • The official documentation is very brief and lacking in some places

cx_Freeze is not the only tool for creating Python executables that integrates with distutils. Two notable examples are py2exe and py2app, which are described in the next section.

py2exe and py2app

py2exe (http://www.py2exe.org/) and py2app (https://py2app.readthedocs.io/en/latest/) are two complementary programs that integrate with Python packaging either via distutils or setuptools in order to create standalone executables. Here they are mentioned together because they are very similar in both usage and their limitations. The major drawback of py2exe and py2app is that they target only a single platform:

Because the usage is very similar and requires only modification of the setup.py script, these packages complement each other. The documentation of the py2app project provides the following example of the setup.py script, which allows you to build standalone executables with the right tool (either py2exe or py2app) depending on the platform used:

import sys 
from setuptools import setup 
 
mainscript = 'MyApplication.py' 
 
if sys.platform == 'darwin': 
    extra_options = dict( 
        setup_requires=['py2app'], 
        app=[mainscript], 
        # Cross-platform applications generally expect sys.argv to 
        # be used for opening files. 
        options=dict(py2app=dict(argv_emulation=True)), 
    ) 
elif sys.platform == 'win32': 
    extra_options = dict( 
        setup_requires=['py2exe'], 
        app=[mainscript], 
    ) 
else: 
    extra_options = dict( 
        # Normally unix-like platforms will use "setup.py install" 
        # and install the main script as such 
        scripts=[mainscript], 
    ) 
 
setup( 
    name="MyApplication", 
    **extra_options 
)

With such a script, you can build your Windows executable using the python setup.py py2exe command and macOS app using python setup.py py2app. Cross-compilation is, of course, not possible.

Despite py2app and py2exe having obvious limitations and offering less elasticity than PyInstaller or cx_Freeze, it is always good to be familiar with them. In some cases, PyInstaller or cx_Freeze might fail to build the executable for the project properly. In such situations, it is always worth checking whether other solutions can handle your code.

Security of Python code in executable packages

It is important to know that standalone executables do not make the application code secure by any means. In fact, there is no reliable way to secure applications from decompilation with the tools available today, and while it is not an easy task to decompile embedded code from executable files, it is definitely doable. What is even more important is that the results of such decompilation (if done with the proper tools) might look strikingly similar to original sources.

Still, there are some ways to make the decompilation process harder.

It's important to note that harder does not mean less probable. For some programmers, the hardest challenges are the most tempting ones. And the eventual prize in this challenge is very high—the code that you tried to keep secret.

Usually, the process of decompilation consists of the following steps:

  1. Extracting the project's binary representation of bytecode from standalone executables
  2. Mapping a binary representation to the bytecode of a specific Python version
  3. Translating bytecode to AST
  4. Re-creating sources directly from AST

Providing the exact solutions for deterring developers from such reverse engineering of standalone executables would be pointless for obvious reasons—they will do it anyway. So here are only some ideas for hampering the decompilation process or devaluing its results:

  • Removing any code metadata available at runtime (docstrings) so the eventual results will be a bit less readable.
  • Modifying the bytecode values used by the CPython interpreter, so conversion from binary to bytecode, and later to AST, requires more effort.
  • Using a version of CPython sources modified in such a complex way that even if decompiled sources of the application are available, they are useless without decompiling the modified CPython binary.
  • Using obfuscation scripts on sources before bundling them into an executable, which will make sources less valuable after the decompilation.

Such solutions make the development process a lot harder. Some of the preceding ideas require a very deep understanding of the Python runtime, and each one of them is riddled with many pitfalls and disadvantages. Mostly, they only delay the inevitable. Once your trick is broken, it renders all your additional efforts a waste of time and effort. This fact means standalone Python executables are not a viable solution for closed-source projects where leaking of the application code could harm the organization.

The only reliable way to not allow your closed code to leak outside of your application is to not ship it directly to users in any form. And this is only possible if other aspects of your organization's security stay airtight (using strong multi-factor authentication, encrypted traffic, and a VPN to start with). So, if your whole business can be copied simply by copying the source code of your application, then you should think of other ways to distribute the application. Maybe providing software as a service would be a better choice for you.

Summary

In this chapter, we have discussed various ways of packaging Python libraries and applications including applications for SaaS/cloud environments as well as desktop applications. Now you should have a general idea about possible packaging tools and strategies for distributing your project. You should also know popular techniques for common problems and how to provide useful metadata to your project.

On our way, we've learned about the importance of the packaging ecosystem and details of publishing Python package distributions on package indexes. We've seen that standard distribution scripts (the setup.py files) can be useful even when not publishing code directly to PyPI.

The real fun begins when your code is made available to its users. No matter how well it is tested and how well it is designed, you will find that your application does not always behave as expected. People will report problems. You will have performance issues. Some things will inevitably go wrong.

To solve those issues, you will need a lot of information to replicate user errors and understand what has really happened. Wise developers are always prepared for the unexpected and know how to actively collect data that helps in diagnosing problems and allows you to anticipate future failures. That will be the topic of the next chapter.

    Reset